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ABSTRACT 

Thread-level parallelism in irregular applications with mu¬ 
table data dependencies presents challenges because the un¬ 
derlying data is extensively modified during execution of the 
algorithm and a high degree of parallelism must be real¬ 
ized while keeping the code race-free. In this article we de¬ 
scribe a methodology for exploiting thread parallelism for 
a class of graph-mutating worklist algorithms, which guar¬ 
antees safe parallel execution via processing in rounds of 
independent sets and using a deferred update strategy to 
commit changes in the underlying data structures. Scalabil¬ 
ity is assisted by atomic fetch-and-add operations to create 
worklists and work-stealing to balance the shared-memory 
workload. This work is motivated by mesh adaptation algo¬ 
rithms, for which we show a parallel efficiency of 60% and 
50% on Intel® Xeon® Sandy Bridge and AMD Opteron™ 
Magny-Cours systems, respectively, using these techniques. 
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1. INTRODUCTION 

Finite element/volume methods (FEM/FVM) are commonly 
used in the numerical solution of partial differential equa¬ 
tions (PDEs). Unstructured meshes, where the spatial do¬ 
main has been discretised into simplices (i.e. triangles in 
2D, tetrahedra in 3D), are of particular interest in applica¬ 
tions where the geometric domain is complex and structured 
meshes are not practical. Simplices are well suited to vary¬ 
ing mesh resolution throughout the domain, allowing for lo¬ 
cal coarsening and refinement of the mesh without hanging 
nodes. On the other hand, this flexibility introduces compli¬ 


cations of its own, such as management of mesh quality and 
computational overheads arising from indirect addressing. 


Computational mesh resolution is often the limiting factor in 
simulation accuracy. Being able to accurately resolve physi¬ 
cal processes at the small scale coupled with larger scale dy¬ 
namics is key to improving the fidelity of numerical models 
across a wide range of applications (e.g. 15, 20 ). A diffi¬ 

culty with mesh-based modelling is that the mesh is gener¬ 
ated before the solution is known, however, the local solution 
error is related to the local mesh resolution. Overly coarse 
meshes lead to low accuracy whereas over-refined meshes 
can greatly increase the computational cost. 


Mesh adaptation methods provide an important means to 
minimise computational cost while still achieving the re¬ 
quired accuracy |16[ [l2] . In order to use mesh adaptation 
within a simulation, the application code requires a method 
to estimate the local solution error. Given an error estimate 
it is then possible to compute a solution to a specified error 
tolerance while using the minimum resolution everywhere in 
the domain and maintaining element quality constraints. 


Previous work has described how adaptive mesh methods 
can be parallelised for distributed-memory systems using 
MPI (e.g. [l2] 10 ). However, there is a continuous trend 
towards an increasing number of cores per compute node in 
the world’s most powerful supercomputers and it is assumed 
that the nodes of a future exascale system will each contain 
thousands of cores [T] . Therefore, it is important that algo¬ 
rithms are developed with very high levels of parallelism and 
using thread-parallel programming models, such as OpenMP 
5], that exploit the memory hierarchy. However, irregu¬ 
lar applications are hard to parallelise effectively on shared- 
memory architectures for reasons described in [13] . 

In this article we take a fresh look at anisotropic adaptive 
mesh methods and parallelise them using new scalable tech¬ 
niques suitable for modern multicore and manycore architec¬ 
tures. These concepts have been implemented in the open 
source framework PRAgMaTIc (Parallel anisotRopic Adap¬ 
tive Mesh Toolkit Q The remainder of the paper is laid out 
as follows: ^2] gives an overview of the mesh adaptation pro- 


1 http: / / meshadapt at ion. github. io / 





cedure; ^describes the new irregular compute methodology 
used to parallelise the adaptive algorithms; and Q illustrates 
how well our framework performs in 2D and 3D benchmarks. 
We conclude the paper in ^5] 

2. MESH ADAPTIVITY BACKGROUND 

In this section we give an overview of anisotropic mesh adap¬ 
tation, focusing on the element quality as defined by an error 
metric and the adaptation kernels which iteratively improve 
local mesh quality as measured by the worst local element. 

2.1 Error control 

Solution discretisation errors are closely related to the size 
and the shape of the elements. However, in general meshes 
are generated using a priori information about the problem 
under consideration when the solution error estimates are 
not yet available. This may be problematic because (a) so¬ 
lution errors may be unacceptably high and (b) parts of the 
solution may be over-resolved, thereby incurring unneces¬ 
sary computational expense. A solution to this is to compute 
appropriate local error estimates and use them to dynam¬ 
ically control the local mesh resolution at runtime. In the 
most general case this is a metric tensor field so that the 
resolution requirements can be specified anisotropically; for 
a review of the procedure see 11 . 

2.2 Element quality 

As discretisation errors are dependent upon element shape 
as well as size, a number of measures of element quality have 
been proposed, out of which, in the work described here, we 
use the quality functionals by Vasilevskii et al. for trian¬ 
gles [22] and tetrahedra [T], which indicate that the ideal 
element is an equilateral triangle/tetrahedron with edges of 
unit length measured in metric space. 

2.3 Overall adaptation procedure 

The mesh is adapted through a series of local operations: 
edge collapse, edge refinement, element-edge swaps and ver¬ 
tex smoothing. While the first two of these operations con¬ 
trol the local resolution, the latter two are used to improve 
the element quality. Algorithm [I] gives a high level view 
of the anisotropic mesh adaptation procedure as described 
by Li et al. 12]. The inputs are A4, the piecewise linear 
mesh from the modelling software, and S, the node-wise 
metric tensor field which specifies anisotropically the local 
mesh resolution requirements. The process involves the it¬ 
erative application of coarsening, swapping and refinement 
to optimise the resolution and quality of the mesh. The 
loop terminates once the mesh optimisation algorithm con¬ 
verges or after a maximum number of iterations has been 
reached. Finally, mesh quality is fine-tuned using some ver¬ 
tex smoothing algorithm, which aims primarily at improving 
the worst-element quality. Smoothing is left out of the main 
loop because it is an expensive operation and it is found 
empirically that it is efficient to fine-tune the worst-element 
quality once mesh topology has been fixed. 

2.4 Adaptation kernels 

A brief description of the four mesh optimisation kernels 
follows. Figure [l] shows 2D examples to demonstrate what 
each kernel does to the local mesh patch, but the same op¬ 
erations are applied in an identical manner in 3D. For more 


Algorithm 1 Mesh optimisation procedure. 

Inputs: A4, S. 

repeat 

(A4*,<S*) coarsen(M *, <S*) 

(A4*,S*) swap(A4*, S*) 

(M*,S *) <- refine(M*, <S*) 
until (max. number of iterations or convergence) 
<r- smooth(M *, <S*) 
return Ad* 


Coarsening Swapping 



Figure 1: Examples of the four adaptive kernels. 


details on the adaptive algorithms the reader is referred to 
the publications by Li et al. 1^] (coarsening, refinement, 
swapping) and Freitag et al [8]] 9 (smoothing). 

2.4.1 Coarsening 

Coarsening is the process of lowering mesh resolution locally 
by collapsing an edge to a single vertex, thereby removing 
all elements that contain this edge, leading to a reduction in 
the computational cost. 

2.4.2 Refinement 

Refinement is the process of increasing mesh resolution lo¬ 
cally by (a) splitting of edges which are longer than desired 
(as indicated by the error estimation) and (b) subsequent 
division of elements using refinement templates [6]. 

2.4.3 Swapping 

Swapping is done in the form of flipping an edge shared 
by two elements, considering the quality of the swapped ele¬ 
ments - if the minimum quality is improved then the original 
mesh elements are replaced with the edge-swapped elements. 

2.4.4 Smoothing 

The kernel of vertex smoothing relocates a central vertex so 
that the local mesh quality is increased. Common heuristic 
methods are the quality-constrained Laplacian smoothin g [8] 
and the more expensive optimisation-based smoothing M3 

2.4.5 Propagation 

The operations of coarsening, swapping and smoothing of¬ 
ten need to be propagated to the local mesh neighbour¬ 
hood. When a kernel is applied onto an edge/vertex, neigh¬ 
bouring edges/vertices need to be reconsidered for process¬ 
ing because the topological/geometrical changes that oc¬ 
curred might give rise to new configurations of better quality. 
Therefore, these adaptive algorithms keep sweeping over the 
mesh until no further changes are made. 







3. IRREGULAR COMPUTE METHOD 

To allow fine grained parallelisation of mesh adaptation we 
based our methodology on graph colouring, following a pro¬ 
posal by Freitag et al. [To]. However, while this approach 
avoids updates being applied concurrently to the same neigh¬ 
bourhood, data writes will still incur significant lock and 
synchronisation overheads. For this reason we incorporate 
a deferred update strategy, described below, to minimise 
synchronisations and allow parallel writes. Additionally, we 
make use of atomic operations to create parallel worklists 
in a synchronisation-free fashion and, finally, try to balance 
the workload among threads using work-stealing [5]. 

3.1 Hazards 

There are two types of hazards when running mesh optimi¬ 
sation algorithms in parallel: topological hazards ; and data 
races. The former refers to the situation where an adaptive 
operation results in invalid or non-conforming edges and el¬ 
ements. For example in coarsening, if some vertex Vb col¬ 
lapses onto another vertex Fa , then Va cannot collapse onto 
some other vertex at the same time. Data races can oc¬ 
cur when two threads try to update the same adjacency list 
of a vertex concurrently. For example in coarsening, two 
neighbours of some vertex Va can collapse onto Va concur¬ 
rently, then Fa’s adjacency list has to be updated to reflect 
the changes made by the coarsening operations. Concurrent 
access to Fa’s adjacency list may lead to race conditions. 

3.2 Colouring 

Topological hazards for all adaptive algorithms are avoided 
by colouring a graph whose nodes are defined by the mesh 
vertices and edges are defined by the mesh edges. The adap¬ 
tive algorithm then processes the mesh in batches of inde¬ 
pendent sets. The fact that topological changes are made 
to the mesh means that colouring is invalidated frequently 
and the mesh has to be re-coloured before proceeding to the 
next iteration of the adaptive algorithm. Therefore, we need 
to use a fast and scalable colouring algorithm (see [Ts]). 

3.3 Deferred Update 

Colouring does not eliminate data races when updating ad¬ 
jacency lists. A 2-distance colouring was not considered here 
as it is expensive and increases the chromatic number, effec¬ 
tively reducing the exposed parallelism. Instead, in a shared- 
memory environment with N threads, each thread allocates 
a private collection of N lists. When the adjacency list of 
some vertex Vi has to be updated, the thread executing the 
adaptive kernel does not commit the update immediately; 
instead, it pushes the operation back into the list for thread 
tid = ID(Vi)%N , where ID (Vi) is the integer identifier of 
Vi. After processing an independent set and before pro¬ 
ceeding to the next one, every thread Ti visits the private 
collections of all threads, locates the list reserved for Ti and 
commits the updates stored there. This way, it is guaranteed 
that one and only thread will update the adjacency list of 
any given vertex. We call this technique the deferred update. 
Code Snippet [l] demonstrates a typical usage scenario. An 
important advantage of this strategy is that we always read 
the most up-to-date data when executing an adaptive kernel 
(as if we used an “as we go” write-back scheme), eliminating 
the risk of mesh data corruption in coarsening, refinement 
and swapping and having a faster-converging Gauss-Seidel- 
style iteration process in smoothing. 


typedef std::vector<Updates> DefUpdList; 
int N = omp_get_max_threads ( ) ; 

// N collections of deferred—update lists 

std : : vector< s t d : : ve c t or <De f UpdL i s t > > defUpd(N); 

#pragma omp parallel 

{ 

int tid = omp_get_thread_num(); 

// Allocate one list for each thread. 
defUpd [tid] . resize(N ) ; 

// Process the independent set in parallel 
#pragma omp for 

for(int i=0; i<nVe r t i c e s I nS e t ; ++i ) { 
update = execute.kernel(i ) ; 

// To be committed by thread i%N. 
defUpd [tid] [i%N ] . push_back(update ); 

} 

// Commit updates tid is responsible for. 
for(int i=0; i<N ; +-(-i) 

commit_all_updates(defUpd [i] [tid ] ); 

// Proceed to the next independent set ... 

} 


Code Snippet 1: Example of the deferred update scheme. 


3.4 Worklists and Atomic-Capture 

There are many cases where it is necessary to create a work- 
list of items which need to be processed, e.g. for propagation 
of adaptive operations. New work items generated locally by 
a thread need to be accumulated into a global worklist over 
which the next invocation of the adaptive kernel will iterate. 
The classic approach based on prefix sums [5 requires thread 
synchronisation and was found limiting in terms of scalabil¬ 
ity. A better method is based on atomic fetch-and-add on a 
global integer which stores the size of the worklist needed so 
far. Every thread increments this integer atomically while 
caching the old value. This way, the thread knows where to 
copy its private data and increments the integer by the size 
of this data, so the next thread to access the integer knows in 
turn where to copy its private data. An example of using this 
technique via OpenMP’s atomic-capture clause [l4] is given 
in Code Snippet [2] where it is shown that no thread syn¬ 
chronisation is needed to generate the global worklist (note 
the nowait clause). The overhead/spinlock associated with 
atomic-capture operations was found to be insignificant. 

3.5 Work-stealing 

Work-stealing [3] is a sophisticated technique aiming at bal- 
ancing workload among threads while keeping scheduling 
overhead as low as possible. For-loop scheduling strategies 
provided by the OpenMP runtime system were found to be 
inadequate, either incurring significant scheduling overhead 
or leading to load imbalances. As of version 4.0, OpenMP 
does not support work-stealing for parallel for-loops so we 
created a novel scheduler |[l9] which differs from other pro¬ 
posals in two ways: it engages a heuristic to help the thief 
find a suitable victim to steal from; and uses POSIX signal¬ 
s/interrupts to accomplish stealing in an efficient manner. 

4. EXPERIMENTAL RESULTS 

We will show adaptivity results for viscous fingering in 2D 
and structural compliance minimisation in 2D and 3D, fol¬ 
lowed by performance evaluation. 
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1 int worklistSize = 0; 

2 std::vector<Item> globalWorklist(preal1oc_size); 

3 

4 #pragma omp parallel 

5 { 

6 std :: vector <Item> pr i v at e _ 1 i s t ; 

7 

8 ^pragma omp for nowait 

9 for(all items which need to be processed)! 

10 new.item = do.some.work(); 

11 private_list .push.back(new.item ) ; 

12 } // No need to synchronise at end of loop. 

13 

14 int idx ; 

15 #pragma omp atomic capture 

16 { 

17 idx = worklistSize; 

18 worklistSize += private_1ist . size (); 

19 } 

20 

21 memcpy (&globalWorklist [idx] , &private_list [0] , 

22 pr i vat e _ 1 i s t . s ize () * s i z e o f ( 11 em ) ) ; 

23 } 


Code Snippet 2: Creating a worklist using atomic-capture. 



Figure 2: The initial condition for the viscous fin¬ 
gering (left) and a snapshot of a simulation (right). 


4.1 Viscous Fingering 

Viscous fingering is a limiting process for enhanced oil re¬ 
covery technologies. It happens whenever one fluid displaces 
another with a higher viscosity [l7], typically in a porous 
media. A typical setup and simulation is shown in Figure 
[5] with the blue fluid having a viscosity e 2 times lower than 
the red fluid. The initial saturation is unperturbed and it 
is thus the length scale of the initial mesh that triggers the 
instability. Mesh adaptation is driven by the Hessian of the 
pressure combined with the Hessian of the saturation 0 [ 4 ]. 

4.2 Structural Optimisation 

Structural compliance minimisation is concerned with the 
problem of finding stiff and lightweight mechanical compo¬ 
nents [ 2 I], often in the context of linear elasticity. The setup 
for a classical cantilever problem with support to the left 
and a load to the right is shown in Figure [ 3 ] (top left). The 
question is how to form the stiffest possible link between 
the two boundaries, given a certain amount of isotropic ma¬ 
terial. The problem is ill-posed unless a minimum length 
scale is imposed for the design, because the optimal struc¬ 
ture is a composite. In fact, one can see a tendency towards 
microstructured areas when a small minimum length scale 
T m in = 10 -3 L c har is used as illustrated in Figure pi (bottom 
left). Note how the many straight and parallel connections 



Figure 3: The setup for structural compliance min¬ 
imisation (top left) and the result for the case of 
a small minimum length scale (bottom left). Red 
corresponds to solid areas and blue to void. The 
result of compliance minimisation in 3D is shown in 
terms of the cross-sectional view (top right) and the 
solid/void interface (bottom right). 


can be efficiently resolved with anisotropic elements. Mesh 
adaptation is driven by the Hessians of the design and the 
topological derivative [21 . A Helmholtz filter is applied to 
both design and derivative to smooth out features smaller 
than L m in, before the Hessians are computed. 

The two dimensional setup is also extruded to three dimen¬ 
sions, as plotted in Figure [ 3 ] (top right and bottom right). 
Note that the large planar areas with little curvature are 
well resolved by the anisotropic elements. The increased di¬ 
mensionality leads to a much simpler topology even though 
the optimisation is performed with L m i n = 5 • 10 _3 L c har- 

In order to evaluate the parallel performance, a synthetic 
solution ^ is defined to vary in time and space: 

( 27rt\ 

^(x, y , t) = 0.1 sin ( 50x -\—— I + arctan 

where T is the period. This is a good choice as a benchmark 
as it contains multi-scale features and a shock front, i.e. the 
typical solution characteristics where anisotropic adaptive 
mesh methods excel. An isotropic mesh was generated on 
the unit square using approximately 200 x 200 triangles and 
the adaptation benchmark was run with a requirement for 
~ 550 k elements. The same example was extruded in 3D, 
where an isotropic mesh was generated in the unit cube us¬ 
ing approximately 50 x 50 x 50 tetrahedra and the adap¬ 
tation benchmark was run with a requirement for « 210 k 
elements. 3D swapping has not been parallelised, therefore 
the corresponding results have been omitted. 

The code was compiled using the Intel compiler suite (ver¬ 
sion 14.0.1) and with the -03 optimisation flag. We used two 
systems to evaluate performance: (a) a dual-socket Intel® 
Xeon® E5-2650 system (Sandy Bridge, 2.00GHz, 8 cores per 
socket, 2-way hyper-threading) running Red Hat®Enterprise 
Linux® Server release 6.4 (Santiago) and (b) a quad-socket 
AMD Opteron™ 6176 system (Magny-Cours, 2.3GHz, 12 
cores per socket) running Ubuntu 12.04.5. In all cases, 
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thread-core affinity was used. Figures [4] and [5] show the 
average (over 50 time steps) execution time per time step 
and parallel efficiency against the number of threads using 
single-socket (SS), dual-socket (DS) and quad-socket (QS) 
configurations. On the Intel®Xeon® system we also enable 
hyper-threading (HT) to make use of all 40 logical cores. 

Running on one socket, our code achieves a parallel efficiency 
of over 60% on Intel®Xeon® and around 50% on AMD 
Opteron™. Smoothing scales better than the other algo¬ 
rithms as it is more compute-intensive, which favours seal- 
ability, reaching an efficiency of over 50% even in the quad- 
socket case. When we move to more sockets, NUMA effects 
become pronounced, which is expected as common NUMA 
optimisations such as pinning and first-touch for page bind¬ 
ing are ineffective for irregular computations. Nonetheless, 
the achievable efficiency is good considering the irregular 
nature of those algorithms. 

5. CONCLUSION 

In this paper we examined the scalability of anisotropic mesh 
adaptivity using a thread-parallel programming model and 
explored new parallel algorithmic approaches to support this 
model. Despite the complex data dependencies and inher¬ 
ent load imbalances we have shown it is possible to achieve 
practical levels of scaling using a combination of a fast graph 
colouring technique, the deferred-update strategy, atomic- 
based creation of worklists and for-loop work-stealing. In 
principle, this methodology facilitates scaling up to the point 
where the number of elements of an independent set is equal 
to the number of available threads. 
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Figure 5: Execution time and parallel efficiency of 
2D and 3D synthetic benchmarks on the 4xl2-core 
AMD Opteron™ Magny-Cours system. 
























